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Abstract 

This  paper  presents  the  work  done  for  the  TREC 
2008  blog  distillation  task.  We  introduce  two  new 
methods  based  on  blog  site  search  using  resource 
selection  which  was  the  framework  we  used  for  the 
TREC  2007  blog  distillation  task.  One  is  a  new  fac¬ 
tor  that  penalizes  the  topical  diversity  of  a  blog.  The 
other  is  a  query  expansion  technique.  We  compare 
the  methods  to  strong  baselines. 

1  Introduction 

The  TREC  2008  Blog  Track  is  composed  of  four 
tasks:  the  baseline  adhoc  retrieval  task,  the  opinion 
finding  task,  the  polarized  opinion  task  and  the  blog 
distillation  task.  We  participated  in  the  blog  distil¬ 
lation  task.  The  goal  of  the  blog  distillation  task  is 
‘Find  feeds  that  are  principally  devoted  to  topic  X’. 
The  blog  distillation  task  is  different  from  the  other 
tasks  because  it  finds  feeds  of  relevant  blogs  whereas 
the  other  tasks  find  relevant  postings  satisfying  a  spe¬ 
cial  need,  e.g.,  postings  with  positive  opinions.  A 
blog  usually  has  a  feed,  and  finding  a  feed  can  be  in¬ 
terpreted  as  finding  a  blog  or  a  blog  site  owning  the 
feed. 

The  TREC  2008  blog  distillation  task  is  very  sim¬ 
ilar  to  the  TREC  2007  blog  distillation  task  in  terms 
of  the  goal  and  the  dataset.  Therefore,  we  use  the 
same  framework  that  we  used  last  year,  which  is  ‘ blog 
site  search  using  resource  selection ’  [6].  The  frame¬ 
work  consists  of  two  factors:  a  basis  factor  for  re¬ 
source  selection  and  a  supplementary  factor  handling 
topicality  of  blogs.  This  year,  we  suggest  a  new  sup¬ 


plementary  factor.  Further,  we  investigate  a  query 
expansion  technique  for  our  framework. 


2  Dataset  and  processing 

We  used  the  permalink  collection  (posting  collection) 
instead  of  the  feed  collection  because  of  the  following 
two  reasons.  First,  the  goal  of  feed  search  is  to  find  a 
blog  to  be  subscribed  to  through  the  feed  link  rather 
than  directly  getting  information  from  the  feed.  In 
other  words,  the  feed  search  task  can  be  considered 
as  a  blog  search  task.  Therefore,  we  do  not  need  to 
insist  on  using  only  the  feed  collection.  Second,  a  feed 
often  contains  only  a  small  amount  of  text  because  it 
is  a  short  summary  of  a  posting.  If  we  use  only  the 
feed  collection,  retrieval  performance  may  suffer  from 
the  sparsity  of  words  that  causes  a  word  mismatch 
problem. 

The  permalink  collection  was  processed  as  follows. 
We  stripped  HTML  tags  and  stemmed  text  by  the 
Krovetz  stennner.  Stopwords  were  used  only  at  query 
time.  We  did  not  perform  any  pre-processing  such 
as  filtering  out  splogs  and  non-English  blogs  because 
our  framework  was  shown  to  be  able  to  deal  with  such 
issues  in  the  TREC  2007  blog  distillation  task. 

Our  algorithms  retrieve  relevant  postings  or  blogs 
whereas  the  blog  distillation  task  requires  relevant 
feed  IDs  (FEEDNO)  for  the  submitted  runs.  There¬ 
fore,  we  had  to  convert  the  posting  IDs  or  the  blog 
IDs  in  our  search  results  to  feed  IDs  (FEEDNO). 
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3  Blog  Site  Search  Using  Re¬ 
source  Selection 

A  blog  is  a  collection  of  its  postings.  That  is,  finding 
relevant  blog  sites  is  similar  to  finding  relevant  collec¬ 
tions.  This  is  a  typical  cases  where  resource  selection 
in  distributed  information  retrieval  is  used.  There¬ 
fore,  we  exploit  resource  selection  techniques  for  this 
task.  We  call  this  approach  “blog  site  search  using  re¬ 
source  selection”.  Note  that  all  described  techniques 
are  based  on  language  modeling  retrieval  techniques 
[!]• 

3.1  Baseline 

We  review  two  resource  selection  techniques  for  the 
blog  distillation  task:  Global  Representation  and 
Pseudo  Cluster  Selection  [5,  6]. 

Global  Representation  handles  a  blog  as  a  big  doc¬ 
ument  where  all  postings  of  the  blog  are  concatenated 
into  a  virtual  document.  We  can  rank  the  blogs  by 
a  language  model  learned  from  the  virtual  document 
as  follows: 

4>Gr(Qi  Ci)  =  P(Q\DCi) 

where  4>gr(Qi  g)  is  a  ranking  function  for  Global 
Representation,  Q  is  a  query,  and  DCi  is  a  virtual 
document  for  blog  c,.  For  Global  Representation,  we 
need  to  make  an  index  for  the  virtual  documents.  We 
refer  to  the  virtual  document  collection  and  the  index 
as  a  blog  collection  and  a  blog  index,  respectively. 

Pseudo-cluster  Selection  is  inspired  by  a  clustering- 
based  resource  selection  technique  which  is  known  for 
being  effective  [7] .  We  assume  that  if  some  postings  of 
a  blog  are  highly  ranked  in  the  posting  search  result 
for  a  query  (a  topic),  then  the  postings  virtually  form 
a  topic-based  cluster  which  we  call  a  topic-dependent 
pseudo-cluster.  We  can  rank  the  pseudo-clusters  by 
using  cluster-based  retrieval  techniques.  Since  many 
cluster  representation  techniques  have  bias  problems 
from  long  documents  or  high  frequency  terms  in  the 
cluster,  we  use  a  geometric-mean  cluster  representa¬ 
tion  technique  which  is  known  to  be  relatively  free 


from  such  biases  [4]  as  follows: 

k  \  * 

IIUQkU  I 

where  4>pcs(Q,ci)  is  a  ranking  function  for  Pseudo¬ 
cluster  Selection  and  dij  is  the  jth  posting  of  blog  Cj 
in  the  top  N  posting  search  result. 

Pseudo-cluster  Selection  requires  K  postings  for 
each  cluster.  However,  a  blog  cannot  always  have 
more  than  K  postings  in  the  top  N  posting  search 
result.  Therefore,  if  the  number  of  postings,  hi  from 
blog  site  Ci  in  the  ranked  list  is  less  than  A',  we  ma¬ 
nipulate  the  representation  so  that  the  upper  bound 
is  used  instead. 

dmin  =  arg  min  P(Q\dij) 

dij 

4>PCs(Q,Ci)  =  |p(Q|dmin)M  fjp(Q|d«)j 

This  technique  uses  an  index  for  postings,  i.e. ,  doc¬ 
uments  in  the  permalink  collection.  In  this  paper,  we 
call  the  permalink  collection  and  the  index  a  posting 
collection  and  a  posting  index,  respectively. 

Global  Representation  and  Pseudo-cluster  Selec¬ 
tion  have  their  pros  and  cons.  Global  Representation 
is  easily  dominated  by  long  postings  although  it  is 
very  simple  and  it  handles  topicality  of  a  blog  well. 
Pseudo-cluster  Selection  behaves  like  sampling  sev¬ 
eral  relevant  postings  from  a  blog.  As  long  as  the  blog 
addresses  a  small  number  of  topics,  Pseudo-cluster 
Selection  works  well.  However,  if  a  blog  addresses 
too  many  topics  (for  example,  the  blog  contains  thou¬ 
sands  of  postings  related  to  other  topics  and  only  ten 
relevant  postings),  then  this  method  may  not  work. 

To  tackle  the  weakness  of  both  methods,  we  con¬ 
sider  combinations  of  the  two  methods.  In  the  com¬ 
bination,  a  superior  method  becomes  a  basis  factor 
and  the  other  becomes  a  supplementary  factor.  The 
basis  factor  is  fixed  whereas  the  supplementary  fac¬ 
tor  can  be  sometimes  replaced  by  other  techniques 
depending  on  tasks.  In  practice,  Pseudo-cluster  Se¬ 
lection  only  uses  an  index  for  posting  search  that 


fipcsiQiCi)  = 


most  blog  publishing  services  already  have.  This 
benefit  in  the  system  implementation  gives  Pseudo¬ 
cluster  Selection  some  superiority.  Furthermore,  Seo 
and  Croft  showed  Pseudo-cluster  Selection  outper¬ 
formed  Global  Representation  in  blog  site  search  [5]. 
Therefore,  Pseudo-cluster  Selection  is  considered  as 
our  basis  factor.  Accordingly,  Global  Representation 
becomes  the  supplementary  factor. 

The  combination  is  computed  by  multiplication  of 
the  ranking  function  of  each  method  as  follows. 

0 baseline  (Q,Ci)  =  4>PCs(Q,Ci)  ■  4>GR{Q,Ci)  (1) 

This  combination  showed  good  results  in  the 
TREC  2007  blog  distillation  task  [6].  We  use  it  as 
our  baseline  here. 

3.2  New  Supplementary  Factor 


then  many  of  the  sampled  postings  are  likely  to  ad¬ 
dress  similar  topics  and  have  high  query-likelihood 
scores.  On  the  other  hand,  if  the  blog  addresses  var¬ 
ious  topics,  the  sampled  postings  may  address  dif¬ 
ferent  topics  and  have  relatively  low  query-likelihood 
scores.  Therefore,  we  can  estimate  the  topical  diver¬ 
sity  of  a  blog  by  these  randomly  sampled  postings. 
We  call  a  set  of  the  postings  sampled  from  a  blog 
a  topic-independent  pseudo-cluster.  We  can  rank 
the  pseudo-clusters  using  the  same  geometric-mean 
representation  as  Pseudo-Cluster  Selection.  We  re¬ 
place  the  ranking  function  of  Global  Representation 
in  Equation  1  with  the  ranking  function  of  this  new 
pseudo-cluster  as  follows: 


The  combination  of  Global  Representation  and 
Pseudo-cluster  Selection  is  very  effective  for  both  feed 
search  and  blog  site  search  [5] .  However,  a  blog  index 
is  additionally  required  for  Global  Representation  as 
we  mentioned  in  Section  3.1.  In  practice,  operating 
with  two  indexes  for  a  system  requires  considerable 
extra  effort.  To  avoid  such  a  burden,  we  consider  a 
new  supplementary  factor  which  can  substitute  for 
Global  Representation. 

A  major  role  of  Global  Representation  in  the  com¬ 
bination  is  penalizing  the  topical  diversity  of  a  blog. 
A  goal  of  the  blog  distillation  task  is  locating  blogs 
where  users  can  consistently  get  relevant  information 
through  feed  subscriptions.  If  a  user  subscribes  to  a 
blog  which  addresses  too  many  topics,  then  feeds  of 
various  topics  will  be  delivered  from  the  blog.  There¬ 
fore,  penalizing  topic  diversity  is  necessary  for  effec¬ 
tive  feed  search.  In  Global  Representation,  if  a  blog 
addresses  many  topics,  the  effect  of  terms  related  to 
one  topic  tends  to  be  diluted  by  terms  for  other  topics 
in  the  language  model.  Accordingly,  it  penalizes  top¬ 
ical  diversity.  Our  new  supplementary  factor  must  be 
able  to  play  this  role. 

Let’s  consider  randomly  sampling  several  postings 
from  a  blog  without  regard  to  relevance.  We  can  com¬ 
pute  the  query-likelilrood  scores  of  the  sampled  post¬ 
ings  for  a  given  query.  If  the  blog  is  topic-focused, 


where  ri:j  is  the  jth  randomly  selected  posting  of  blog 
site  Cj.  Seo  and  Croft  showed  this  supplementary 
factor  is  comparable  to  Global  Representation  [5]. 

This  supplementary  factor,  however,  produces 
time-variant  results  because  it  is  based  on  random 
samples.  This  time-variant  property  is  not  usually 
preferred  in  information  retrieval  system  in  that  it 
may  confuse  users.  Therefore,  we  suggest  a  strate¬ 
gic  sampling  approach,  i.e.  sampling  only  recently 
updated  postings  instead  of  sampling  postings  ran¬ 
domly  as  follows: 


P(QK)j 

where  rC  is  the  jth  recent  posting  of  blog  site  c*. 

We  believe  that  this  sampling  is  capable  of  pe¬ 
nalizing  diversity  because  it  can  still  form  a  topic- 
independent  pseudo-cluster.  This  sampling  is  sensi¬ 
tive  to  temporal  aspects  of  blogs  because  it  prefers 
blogs  which  have  recently  updated  relevant  postings. 
This  property  may  be  beneficial  depending  on  the 
type  of  queries  and  the  goal  of  tasks.  We  will  look 
into  how  suitable  this  sampling  is  for  the  blog  distil¬ 
lation  task. 


i(Q,  (k)  =  4>pcs(Q,  Cj) 


3.3  Query  Expansion 

Query  expansion  effectively  deals  with  the  word  mis¬ 
match  problem  caused  by  short  queries.  Since  queries 
for  the  blog  distillation  task  are  usually  short  (each  ti¬ 
tle  query  of  the  TREC  2008  blog  distillation  task  top¬ 
ics  consists  of  about  two  words),  we  expect  that  query 
expansion  could  play  an  important  role  for  achieving 
accurate  retrieval.  To  expand  queries,  we  use  the  rel¬ 
evance  model  by  Lavrenko  and  Croft  [3]. 

Our  framework  uses  the  combination  of  factors 
based  on  two  different  collections,  i.e.  the  posting 
collection  and  the  blog  collection.  This  is  somewhat 
different  from  the  usual  environment  of  query  expan¬ 
sion.  Diaz  and  Metzler  [2]  showed  that  the  relevance 
model  can  be  improved  by  using  any  external  collec¬ 
tion  that  contains  more  relevant  documents  than  the 
original  (target)  collection.  They  expanded  queries 
using  the  external  collection  and  ran  the  expanded 
queries  against  the  target  collection.  This  approach 
seems  appropriate  in  that  we  also  have  two  indexes. 

Our  two  collections  have  different  characteristics 
although  they  are  originally  from  the  same  document 
collection,  i.e.  the  permalink  collection.  Particularly, 
when  we  look  at  a  posting  or  blog  in  each  collection 
as  a  topic  unit,  a  posting  is  relatively  topic-oriented 
compared  to  a  blog  because  a  blog  usually  addresses 
multiple  topics.  If  we  retrieve  the  top  N  documents 
from  each  collection,  then  topical  terms  tend  to  be 
more  densely  distributed  in  the  postings  than  the 
blogs.  Therefore,  we  set  the  posting  collection  as  the 
external  collection  which  contains  more  relevant  doc¬ 
uments  in  terms  of  the  mixture  of  relevance  models. 
On  the  other  hand,  our  target  collection  is  the  blog 
collection  since  our  goal  is  finding  relevant  blogs  or 
their  feeds.  We  apply  the  mixture  of  relevance  models 
to  this  setting. 

4  Experiments 

We  used  the  Indri1  search  engine  for  experiments. 
We  built  two  indexes.  We  concatenated  documents 
with  the  same  feed  ID  (FEEDNO)  in  the  permalink 
collection  into  a  virtual  document  and  made  a  blog 

1  http:  / /www. lemurproject.org/indri/ 


index  with  them.  We  also  made  a  posting  index  with 
the  permalink  collection. 

We  submitted  four  runs.  The  first  run  (UMass- 
Blogl)  used  the  baseline,  i.e.  the  combination  of 
Global  Representation  and  Pseudo-cluster  Selection. 
This  run  used  titles  of  the  TREC  topic  1051-1100 
as  queries.  The  second  run  (UMassBlog2)  replaced 
the  ranking  function  of  Global  Representation  of  the 
baseline  with  the  ranking  function  based  on  the  topic- 
independent  pseudo-cluster  composed  of  recent  post¬ 
ings  in  each  blog  as  described  in  Section  3.2.  The  sec¬ 
ond  run  also  used  the  titles  of  the  topics  as  queries. 
The  third  run  used  the  same  approach  as  the  first  run 
except  that  both  the  titles  and  the  descriptions  of  the 
topics  were  used  as  queries.  The  fourth  run  used  the 
query  expansion  technique  suggested  in  Section  3.3. 
The  titles  of  the  topics  were  used  as  queries  for  this 
run.  We  performed  the  third  run  in  order  to  com¬ 
pare  our  query  expansion  to  manual  query  expansion 
because  including  terms  in  the  description  as  query 
terms  can  simulate  an  effect  of  manual  query  expan¬ 
sion. 

Our  systems  have  several  parameters.  Global  Rep¬ 
resentation  has  a  Dirichlet  smoothing  parameter  for 
the  language  models,  i.e.  /j.  Pseudo  Cluster  Selection 
has  two  parameters,  i.e.  K  and  fx.  Query  expansion 
has  two  more  parameters,  i.e.  the  number  of  the  ex¬ 
panded  terms  and  the  number  of  documents  used  to 
estimate  the  relevance  model.  To  learn  the  parame¬ 
ters,  we  used  the  relevance  judgments  of  the  TREC 
2007  blog  distillation  task.  The  evaluation  measure 
for  the  training  was  MAP  (mean  average  precision). 

5  Results 

The  results  from  our  runs  are  given  in  Table  1.  The 
relevance  judgments  of  the  TREC  2008  blog  distilla¬ 
tion  task  use  3  grades,  i.e.  non- relevant  (0),  relevant 
(1)  and  highly  relevant  (2).  Accordingly,  we  present 
scores  of  MAP  and  P@10  (precision  at  10)  of  two 
cases:  the  case  when  considering  a  document  whose 
grade  >  1  relevant  and  the  case  when  considering 
a  document  whose  grade  >  2  relevant.  We  also  dis¬ 
play  NDCG  (normalized  discounted  cumulative  gain) 
scores. 


rel>  1 

rel>  2 

Run 

MAP 

P@10 

MAP 

P@10 

NDCG 

UMassBlogl  (Title) 

0.2520 

0.3880 

0.2561 

0.2400 

0.4777 

UMassBlog2  (Title) 

0.2587 

0.4080  * 

0.2515 

0.2400 

0.4751 

UMassBlog3  (Title  +  Description) 

0.2711  * 

0.4240  * 

0.2772  * 

0.2620  * 

0.4969  * 

UMassBlog4  (Title) 

0.2423  ! 

0.3900  ! 

0.2531  ! 

0.2420  ! 

0.4629  ! 

Table  1:  Performance  of  our  submitted  runs.  A  *  indicates  a  significant  improvement  over  the  baseline 
(UMassBlogl).  A  !  indicates  a  significant  degradation  with  respect  to  the  UMassBlog3  (UMassBlog4  only). 


Our  new  supplementary  factor,  the  topic- 
independent  pseudo-cluster  (UMassBlog2)  showed 
comparable  effectiveness  to  the  baseline  (UMass¬ 
Blogl).  When  the  relevance  level  threshold  is  1,  the 
score  of  P@10  of  the  new  factor  is  better  than  that 
of  the  baseline.  It  shows  that  the  relevance  judg¬ 
ments  may  be  based  on  the  recency  of  relevant  post¬ 
ings  in  some  degree.  Not  surprisingly,  the  run  using 
descriptions  of  topics  (UMassBlog3)  significantly  out¬ 
performed  the  other  runs.  However,  the  performance 
of  our  query  expansion  technique  (UMassBlog4)  is 
somewhat  disappointing.  It  did  not  show  any  im¬ 
provement  over  the  baseline,  and  further  it  was  sig¬ 
nificantly  worse  than  the  manual  query  expansion 
(UMassBlog3).  The  question  “What  are  the  proper 
query  expansion  techniques  for  our  framework?”  re¬ 
mains  unsolved. 


6  Conclusions 

We  performed  follow-up  study  for  our  successful 
framework  used  in  TREC  2007,  blog  site  search  using 
resource  selection.  We  introduced  a  new  supplemen¬ 
tary  factor  to  substitute  a  role  of  Global  Representa¬ 
tion.  The  factor  is  better  suited  to  real  world  imple¬ 
mentations.  Further,  the  performance  of  this  factor 
was  comparable  to  or  even  better  than  the  baseline 
in  our  experiment.  We  also  suggested  a  query  ex¬ 
pansion  approach  for  our  framework.  However,  it 
did  not  demonstrate  any  improvement  on  the  base¬ 
line.  We  plan  to  explore  more  advanced  information 
retrieval  techniques  including  other  query  expansion 
techniques  for  our  framework  in  the  future. 
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